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Abstract Finding similar entities is a fundamental problem in graph database management and analyt¬ 
ics. Similarity search algorithms usually leverage the structural properties of the database to quantify the 
degree of similarity between entities. However, the same information can be represented in many different 
structures and the structural properties observed over particular representations may not hold for alternative 
structures. Thus, these algorithms are effective on some representations and ineffective on others. We de¬ 
fine the property of representation independence for similarity search algorithms as their robustness against 
transformations that modify the structure of databases and preserve their information content. We introduce 
two widespread groups of such transformations called relationship reorganizing and entity rearranging. We 
propose an algorithm called R-PathSim, which is provably robust under relationship-reorganizing and a 
subset of entity-rearranging transformations. Our empirical results show that the output of current algo¬ 
rithms except for R-PathSim are highly sensitive to the data representation and R-PathSim is as efhcient as 
and as effective or more effective than other algorithms. 

1 Introduction 

Finding similar or strongly related entities is a fundamental and important problem in graph data management 
and analytics [III [isi [33 [2S1 [la 12111 m lEi m 1132] ■ It is a building block of algorithms for various important 
database management and analytics problems, such as similarity query processing [11121121, pattern query 
matching [31121 HZ], community detection [I1I21. and link prediction |20j . Since the properties of similar or 
related entities cannot be precisely defined, current similarity and proximity search algorithms use intuitively 
appealing heuristics that leverage information about the links between entities. For instance, Random Walk 
with Restart (RWR) quantifies the degree of similarity or relevance between two entities as the likelihood that a 
random surfer visits one of the entities in the database given it starts and keeps re-starting from the other entity 
|29| . SimRank evaluates the similarity between two entities according to how likely two random surfers will 
meet each other if they start from the two entities [11. Figure H shows fragments of IMDb {imdb.com), which 
contains information about movies, actors, and characters. To represent the relationship between a character, 
its movie, and the actor who played the character IMDb connects these entities through some edges. Assume 
that a user asks for the most similar movie to Star Wars III in Figure H Since the RWR and SimRank score 
of Star Wars V (RWR-score = 0.061, SimRank-score = 0.213) are larger than those of Jumper (RWR-score = 
0.060, SimRank-score = 0.185), RWR and SimRank find Star Wars III more similar to Star Wars V than to 
Jumper, which is arguably an effective answer. 

The power of similarity search algorithms, however, remains out of the reach of most users as today’s 
similarity search algorithms are usable only by trained data analysts who can predict which algorithms are 
likely to be effective for particular databases. To see why, consider the excerpts of Freebase {freebase.com) in 


A:Bell 


CiRice 


A:Bell 


C:Rice 


^:Ford C:Han Solo 


rs V A:Oz 



A:Ford C:Han Solo 

ars V A:Oz 


CrGriffin F:Jumper 
C:Anakin Skywalker 



CrGriffin FiJumper 


F:Star Wars III A:D.Prowse C:Yoda 


CiAnakin Skywalker 


F:Star Wars III ArProwse C:Yoda 


(a) IMDb (b) Freebase 

Figure 1: Fragments of IMDb and Freebase, where A, C, F, and S refer to actor, character, film and starring, 
respectively. 
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Figure [Tbl Figure [T^ contains information about exactly the same set of entities and relationships as Figure ITal 
It differs with Figure[T^only in how it represents the relationships between a character, its movie, and its actor: 
it connects them to a common node labeled starring. Hence, it contains essentially the same information as 
Figure [T^ Database researchers have recognized that different, i.e., non-isomorphic, structures can contain the 
same information mm- As opposed to their results over Figure [T^ RWR and SimRank hnd Star Wars ///more 
similar to Jumper (RWR-score = 0.014, SimRank-score = 0.076) than to Star Wars V (RWR-score = 0.011, 
SimRank-Score = 0.074) in Figure [Tbl 

Generally, there is no canonical representation for a particular set of content and people often represent the 
same information in different structures [T]. Thus, users have to restructure their databases to some proper 
representation(s), to effectively use similarity and proximity search algorithms, i.e., deliver the answers that a 
domain expert would judge as relevant. To make matters worse, these algorithms do not normally offer any 
clear description of their desired representations and users have to rely on their own expertise and/or do trial 
and error to find such representations. Further, the structure of large-scale databases constantly evolve and we 
want to move away from the need for constant expert attention to keep our algorithms effective. 

One approach to solve the problem is to run a similarity search algorithm over all possible representations 
of a data set and select the representation(s) with the most accurate answers. Nevertheless, because most 
similarity algorithms are unsupervised, there is no validating data available to measure the effectiveness of 
these algorithms over various representations. Moreover, it is generally undecidable to compute all possible 
representations of a database [7]. If we restrict the set of possible representations, a database may still have 
enormous representational variations. For example, the number of vertical decompositions of a relational table 
may be exponential in terms of the number its attributes [T]. Further, as graph databases have less restrictive 
schemas than relational databases, they may have more representational variations and need more time to 
generate and run algorithms over them. Researchers have proposed the idea of universal relation to achieve 
some level of schema independence for SQL queries over relational databases [T]. One may extend this idea 
and define a universal representation in which all graph databases can be represented and develop similarity 
search algorithms that are effective over this representation. Nevertheless, the experience gained from the idea 
of universal relation, indicates such representation may not always exist [1]. Further, it may not be practical to 
force developers to represent their data in and create their algorithms for a particular format. 

In this paper, we propose the property of representation independence for similarity search algorithms over 
graph data, i.e., the ability to deliver the same answers regardless of the choices of structure for organizing 
the data. To the best of our knowledge, the property of representation independence has not previously been 
explored for similarity search algorithms and/or graph databases. We believe that the key to the success 
of building representation independent analytics in general and similarity search algorithms in particular is 
to modify current algorithms to become representation independent instead of developing new representation 
independent algorithms from the scratch. Current well-known similarity search algorithms have already been 
adapted in both academia and industry to solve various graph analytics problems. Hence, it is easier for 
organizations to modify these algorithms rather than using new algorithms. They have been shown empirically 
to be effective over some databases, which provide evidences that their reasonable modifications may be effective 
over even more databases. Our contributions are as follows. 

• We introduce and formally define the representation independence of a similarity search algorithm as its 
robustness under transformations that modify the structure of its input database but preserve its information 
content. 

• We introduce a widespread group of transformations called relationship-reorganizing transformations that 
modify the representation of relationships between entities in a database. We show that current similarity 
search algorithms are not representation independent under relationship-reorganizing transformations. We 
extend a current similarity search algorithm called PathSim and develop a new algorithm called Robust- 
PathSim {R-PathSim for short). We prove that R-PathSim is representation independent under relationship 
reorganizing-transformations. 

• We introduce another group of common transformations that reposition entities in a database called entity¬ 
rearranging transformations. We show that current similarity search algorithms including R-PathSim are 
not representation independent under this family of transformations. We extend R-PathSim and prove that 
its extension is robust under entity-rearranging transformations. 

• We empirically study the representation independence of well-known similarity search algorithms under re¬ 
lationship reorganizing and entity-rearranging transformations using several real-world databases and trans¬ 
formations. Our results indicate that relationship-reorganizing and entity-rearranging transformations con¬ 
siderably affect the results of all algorithms but R-PathSim. We also empirically evaluate the effectiveness 
and efficiency of R-PathSim using real-world databases and show that it is as effective or more effective than 
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and as efficient as current similarity search algorithms. 

This paper is organized as follows. Section [5] provides the background and Section [3] defines the property 
of representation independence. Section |4] explores relationship-reorganizing transformations and describes R- 
PathSim. Section [S] introduces entity-rearranging transformations and extends R-PathSim. Section [3] contains 
our empirical results. 

2 Background 

2.1 Related Work 

The architects of relational models envisioned the desirable property of logical data independence. Oversimpli¬ 
fying a bit, this meant that an exact query should return the same answers regardless of the schema chosen 
for the data [HE]. One may consider the idea of representation independence as an extension of the principle 
of logical data independence for similarity and proximity search algorithms. Nevertheless, these ideas differ in 
an important aspect. One may achieve logical data independence for database applications by creating a set 
of views over the database, which keep the application unaffected from modifications in the database schema 
[T]. However, characteristics of the ideal representations for similarity and proximity search algorithms are not 
clearly defined. Also, graph databases follow far less rigid schemas and are more amenable to change than rela¬ 
tional databases. Hence, it takes far more time and more in-depth expertise to find the proper representation 
as well as create and maintain the mapping between the database and this representation. 

Researchers have proposed keyword query interfaces over tree-shaped XML data that return the same answers 
to a keyword query over databases with equivalent content but different choices of structure [25]. We, however, 
introduce and study the concept of representation independence for a different problem and data model. The 
task of similarity and proximity search has a different semantic than keyword search and requires different types 
of algorithms. Further, graph databases are more complex than tree-shaped XML databases and offer novel 
challenges in defining the concept of representation independence and developing representation independent 
algorithms. 

Researchers have also analyzed the stability of random walk algorithms in graphs against relatively small 
perturbations in the data [2i[ini[s]. We also seek to instill robustness in graph mining algorithms, but we 
are targeting robustness in a new dimension: robustness in the face of variations in the representation of data. 
Researchers have provided systems that help users with transforming and wrangling their data uniiiiiniEa. 
We also address the problem of data preparation but using a difference approach: eliminating the need to wrangle 
the data. 

Researchers have proposed several normal forms for relational and tree-shaped XML schemas pQ|3|[31j- Nev¬ 
ertheless, we focus on finding representation independent similarity search algorithms rather than transforming 
the database to a particular representation with some desirable properties. Moreover, because similarity search 
algorithms usually operate over graph databases without rigid schemas, our transformations are defined over a 
much less restrictive schemas than relational schemas or XML DTDs. Our entity-rearranging transformations 
somewhat resemble normalization/ denormalization in relational and tree-shaped XML databases. Our trans¬ 
formations, however, modify the connections between entities in the database instead of creating or removing 
duplicates. They are also defined over graph databases rather than relational or tree-shaped databases. 

Blank nodes represent the existence of resources without any global identifier, i.e., existential variables, 
in RDF databases mi El- As blank nodes often convey redundant information, researchers have proposed 
methods to remove them from RDF databases mi M- However, our goal in this paper is not to remove 
certain nodes from a database. Further, because our databases do not contain any existential variable, we use a 
different approach to ensure that our transformations do not modify the information content of a database. For 
instance, as opposed to our transformations, the mappings that eliminate blank nodes may not be invertible. 
Some serializations of RDF data, such as RDF/XML, may assign labels and identifiers within the scope of 
a document to blank nodes in the document mi- Our framework covers these applications of blank nodes. 
Nevertheless, it also addresses the representational shifts over databases that do not contain any blank node. 
Researchers have proposed algorithms to convert RDF data sets that contain certain relationships in RDF 
Schema, such as rdfs:subClassOf, to some normal forms mi. Our transformations, however, are not limited to 
particular set of relationships. 

Schema mapping has been an active research area for the last three decades [T]. In particular, researchers 
have defined schema mappings over graph databases as constraints in some graph query language in the context 
of data exchange |1|. As opposed to the transformations in our work, the original and transformed databases 
in those settings may not represent the same information. We also focus on evaluating the representation 
independence of similarity search algorithms rather than traditional questions in schema mapping and data 
exchange, such as computing the transformed database instances. 
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2.2 Data Model 


Let dom be a fixed and countably infinite set of values. To simplify our definitions, we assume the members 
of dom are strings. Let L be a finite set of labels. Each member of L denotes a semantic type in a domain of 
interest, e.g. actor and film in movie domain. A database D defined over L is a graph D = iV, E, C, A), where 
V is the set of nodes, E C V x V is the set of edges, £ is a total function from V to L that assigns a label 
to each node, and A is a function from V to dom that assigns values to nodes in V. We denote the set of all 
databases whose labels belong to L as L. 

Real-world graph databases often contain nodes without any value to represent relationships between or 
categorize entities [mill]- Figure M is an example of using nodes without values to represent relationships 
between entities. One may use these types of nodes for several reasons. It is sometimes easier to express 
relationships between relationships in a database using nodes without values, e.g., starring, |12| . For example, 
consider a database that contains information about various types of artists, such as painters. The relationship 
paints between a painter and her paintings is a subclass of the relationship creates between an artist and its 
creations. To capture the subclass relationship between relationships paint and create, one may represent creates 
and paints as nodes without any value and connect them by an edge or through another node that represents 
the relationship subclass-of. Also, one may use nodes without values to categorize related nodes or express 
complex relationships which help users understand the structure of the database more easily. For example, 
RDF data sets often use nodes without any value and global identiher, i.e., blank nodes, to represent complex 
relationships between entities M- Empirical studies using 1.23 billion RDF triples and 8.37 million RDF 
documents collected from the Web indicate that 30% of RDF triples and 44.9% of RDF documents contain 
blank nodes |14| . Following the terminology used in similarity search literature, we call the nodes with values 
entities [IS1I2S]. We assume that each set of labels L has two mutually exclusive and collectively exhaustive 
subsets of N, which contain labels for entities, and R, which contain labels for nodes without values. That is, 
nodes whose labels are in R do not have any value in databases of L. 

We denote a similarity query q, query for short, over database D as (y), where v is an entity node in D. 
Query q = (v) seeks for entity nodes other than v in D that are similar to v [la usi US Hi- For example, 
query {filmiStar Wars III) over the database fragment shown in Figure fTbl asks for other entities similar or 
strongly related to the node filmiStai Wars III in the database. Given query q over database D, a similarity 
search algorithm returns a ranked list of entity nodes from D, i.e., the result of q. For example, the result of 
query (/zZm:Star Wars III) over Figure fTbl could be the list of entities /iZm:Star Wars V and film:Jnmpei. We 
denote the result of query q over database D using similarity search algorithm S by qs{D). If S is clear from 
the context, we denote qs{D) as q{D). 

3 Representation Independence 

A representation-independent similarity search algorithm should return the same list of entities for the same 
query across databases that represent the same information. It is important to precisely define the conditions 
under which two graph databases represent the same information. Researchers have defined the conditions under 
which relational or XML schemas represent the same information dill [Ml- Graph databases, however, do not 
generally follow strict schemas. Hence, we extend the ideas on comparing information contents of databases for 
our data model. 

Transformation T is a function from a set of databases L to a set of databases K, denoted as T : 
L ^ K. For instance, consider set of labels Li = {actor, film, char} and L 2 = {actor, film, char, starring}. 
The databases in Figures [T^ and [Tb| belong to Li and L 2 , respectively. One may define transformation 
TiMDb 2 Freebase ^ Li —^ L 2 , which replaces every triangle between the nodes of labels film, character, and 
actor with a subgraph whose nodes have the same labels and values of the nodes in the triangle and are con¬ 
nected to a single new node with the label starring. This transformation maps the database in Figure |T^ to the 
database in Figure fTbl 

A transformation T is invertible if a database D is reconstructible from information in database T{D). For 
example, transformation TiMDb 2 Freebase is invertible as the original database in Figure [1^ can be reconstructed 
using the information in its transformed one, e.g.. Figure [Tbl However, a transformation that removes the edges 
between each film node and its neighboring actor and character nodes from Figured^ is not invertible because 
there is insufficient information in the transformed database to recover the relationship between film, actor, 
and character nodes. More formally, a data graph D = {V, E, C, A) and D' = {V ,E' ,C' ,A') are isomorphic 
{D = D') iff there is a bijection f : V ^ V such that (1) \fv € V, C{v) = C'{f{v)) and A{v) = A'{f{v)), and 
(2) \/u,v € V, {u,v) S A iff {f{u),f{v)) € E'. Isomorphic databases contain exactly the same set of nodes and 
connectivity between nodes. In the followings, given isomorphic databases D and D’, we say that D and D' are 
the same database. A transformation T : L —>■ K is invertible iff there is a transformation T~^ : K ^ L such 
that, for all D S L, we have T~^{T{D)) = D. Since the transformed database of an invertible transformation 
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contains sufficient information to build the original database, the original and transformed databases contain 
essentially the same information mm- As depicted in Figure [TJ the original and transformed databases of an 
invertible transformation are not generally isomorphic. 

To precisely define representation independence over a transformation T, we should make sure that users 
can pose the same set of queries over databases D and T{D). Similarity search queries over a database D are 
entities of D, thus, D and T{D) should essentially contain the same set of entities. Moreover, similarity search 
algorithms generally view the labels of nodes as their semantic types [^. For example, they assume that the 
nodes with label film in Figure [T^ represent entities from the same semantic type, while the nodes of label film 
and actor belong to different semantic types. They use these pieces of information to find similar nodes more 
accurately. Thus, for these algorithms to return the same results over a transformation T, T should map entities 
of the same label in the original database D to entities with the same label in the transformed database T{D). 
We consider two data values equal iff they are lexicographically equal: they have the same length and contain 
the same characters in the same positions. Our approach can also support other definitions of equality between 
data values. 

Definition 3.1. Transformation T : L ^ K that transforms database D = {V,E,C, A) to T[D) = (Vr, Et,IC, 
A-t) is entity preserving iff there is a bijective mapping M between entities in V and Vr such that 

• For all entities v € V, we have A{v) = At{M(v)). 

• For all entities vi,V 2 G V that C(vi) = C{v 2 ), we have IC{M{vi)) = IC{M{v 2 ))- 

For example, transformation TiMDb 2 Freebase is entity preserving as it does not introduce any new entity 
to or remove any entity from its input databases. An entity preserving transformation T provides a bijective 
mapping between every entity over D to an entity over T{D). By the abuse of notation, we denote the entity in 
database T{D) that is mapped to the entity v in database D, as T{v). To simplify our definitions and proofs, we 
assume that transformations do not rename the labels in databases. Our results extend for the transformations 
that rename labels. 

If a transformation is both invertible and entity preserving, it is similarity preserving. Each similarity 
preserving transformation T maps a databases D to a database T{D) that has the same information and the 
same set of possible queries as D. It further guarantees that the entities of the same semantic type in D share 
the same label in T{D). Hence, it is possible to design an effective similarity search algorithm that returns 
essentially the same answers for every query over D and T{D). Because answers of similarity search algorithms 
are normally in the form of ranked list of entities, we define a representation independent similarity search 
algorithm as follows. 

Definition 3.2. Similarity search algorithm S is representation independent under similarity preserving trans¬ 
formation T : L —^ K iff for each database D Gh and T{D) G K and every query q over D, there is a bijeetive 
mapping N between q{D) and T{q){T{D)) such that 

• for all entities v G q{D) and N{v) G T{q){T{D)), we have N{v) = T{v) 

• entity v appears before entity u in q{D) iff N(y) ranks before N{u) in T(q)(T{D)). 

The first condition in Definition 13.21 guarantees that the answers to query q over databases D and T{q) over 
T{D) contain the same set of entities. Its second condition ensures that these entities appear at the same order 
in results of q and T{q) over D and T{D), respectively. According to Definition 13.21 if answers v and u tie, i.e., 
are placed at the same position, in q{D), T(v) and T{u) must also tie in T{q){T{D)). 

The result of a query is a list of entities, where each entity is shown by its semantic type and value. A 
database may have several entities with equal values from of the same semantic type. Hence, it may not be 
possible to check the first condition of Definition 13.21 using only the semantic types and values of the entities 
in the results of a query. One may assign a unique (printable) id to each entity in the database to address this 
problem [7|. To simplify our framework and definitions, we assume that databases do not contain entities that 
belong to the same semantic type and have equal values. Our results extend for other cases. 

4 Relationship Reorganization 

4.1 Relationship-Reorganizing Transformations 

Generally speaking, a relationship-reorganizing transformation T maps database D to database T{D) such that 
D and T{D) contain the same set of entities and relationships, but they may represent these relationships in 
different forms. More specifically, D and T{D) may express the same relationship between the same set of 
entities using some edges or some nodes without values. For example, Figure[23uses a set of edges to represent 
the relationship between a movie and its actors. However, Figure [5^ expresses the same relationship between 
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the same set of entities by a node without value, i.e., actors. In this section, we formally define this type of 
representational variation. First, we find patterns that represent relationships between entities in a database. 
Then, we define the conditions under which two patterns represent the same information. Finally, we define 
a relationship-reorganizing transformation as a bijective mapping between patterns that represent the same 
information in the original and transformed databases. 

A walk in database is a sequence of nodes and edges where each edge’s endpoints are the preceding and 
following nodes in the sequence. We show a walk in database I? as a sequence of nodes [uq, ..., u„], such that 
Vi are nodes and {vi-i,Vi) ,0 < f < u, are edges in D. For example, wi = [actonFord, actors, film:Sta,T Wars 
V] is a walk in Figured^ Intuitively, a walk represents some relationship between its entities. For example, 
walk wi in Figure ?Ia\ shows that actor Ford has played in movie Star Wars V. One may use paths to capture 
relationships between entities in a database m- But, we show in Section 0 that walks represent more varieties 
of relationships than paths, which enables us to achieve representation independence over more transformations. 
To simplify our framework, we assume that each database is a simple graph: it has at most one edge between 
each two nodes and does not have any loop at each node. Our framework extends for other cases. We are 
interested in walks that express relationships between entities. Hence, we consider only walks that start and 
end with entities. 

Some walks contain consecutive forward and backward traverses from an entity to a node without value. 
For example, walk [actonFord, actors, film-.Stai Wars V, actors, film-.Stai Wars V] in Figure [2bl expresses the 
relationship between actor Ford and movie Star Wars V. It contains consecutive forward and backward traverses 
from entity film:StaT Wars V to the node without value actors. The information expressed by this walk can be 
represented using a shorter walk [actor.Fovd, actors, film:StaT Wars V], which does not contain any consecutive 
forward and backward traverses from film:Star Wars V to actors. Another example of such walks in Figure [2bl 
is [/f/m:Star Wars V, actors, film:StaT Wars V]. This walk does not provide any information regarding the 
relationships between entities in the database. Hence, unless otherwise noted, we consider only walks that does 
not have any consecutive forward and backward traverses from an entity to a node without value because they 
do not contain any information regarding the relationship between entities or their information can be expressed 
by shorter walks. 

The meta-walk of a walk [ui, • • • , u„] in database D = {V, E, C, A) is a sequence of labels [£(ui), • • • , £(z;„)]. 
For example, the meta-walk of walk [actor:Ford, actors, film:Star Wars V] in Figure [2bl is [actor, actors, film]. 
Each meta-walk represents a pattern of relationship between entities of certain semantic types. Some meta-walks 
represent basically the same relationships between the same sets of semantic types. For instance, meta-walk 
[actor, film] in Figure^^and meta-walk [actor, actors, film] in Figure [ 23 represent the relationship of starring in 
movies between the same set of actors and movies. Next, we define the conditions under which two meta-walks 
represent the same relationship between the same set of entities. Given database D = {V, E, C, A), the value 
of an entity node e S H is the pair C{v) : A{v). The value of a walk w = [uq, ..., Vn] is the tuple [oq, ..., am], 
m <n such that oq and am are the values of vg and u„, respectively, and for all 0 < * < j < u and 0 < i',j' < m 
if Oi' and aji are the values of entity nodes Vi and Vj, respectively, then i' < f. For instance, the value of walk 
[actor.'Ford, actors, film:Stai Wars V] is [actorWord, film-.Stai Wars V]. Values of two walks are equal iff they 
have equal arities and their corresponding positions contain the same label and equal values. Two walks are 
content equivalent iff their values are equal. For instance, walk [actor:Ford, film:Stai Wars V] in Figure [2a1 
and walk [actor.Ford, actors, film:Star Wars V] in Figure [?bl are content equivalent. We show content-equivalent 
walks w and x as w = x. Let p{D) denote the set of walks in database D whose meta-walk is p. 

Definition 4.1. Meta-walks pi in database Di and p 2 in database D 2 are content equivalent iff there is a 
bijection M : pi{Di) ^ 2 (^ 2 ) wh ere for all w G pi{Di), w = M{w). 

Meta-walks [actor, film] in Figure [2bl and [actor, actors, film] in Figure [2a1 are content equivalent. We denote 
content-equivalent meta-walks pi and p 2 as pi = P 2 - 

Naturally, content-equivalent meta-walks represent the same sets of relationships between the same sets of 
entities. Thus, if a transformation bijectively maps each meta-walk in database Di to its content-equivalent 
meta-walk in database D 2 , Di and D 2 represent the same information. We formally prove this intuition later 
in the section. However, this straightforward definition ignores some interesting transformations. For example, 
intuitively the databases in Figure !^ and Figure [2bl contain the same information. But, there is not any meta- 
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walk in Figure that is content equivalent to meta-walk p^=\actor, actors, actor] in Figure [2bl By looking 
closely at the Figure [20 and original Movielicious data, we observe that the node actors always groups actors 
that play in the same movie. Thus, each walk of is a part of a walk of meta-walk p 4 = [actor, actors, film, 
actors, actor] in Figure [2bl Hence, if a transformation maps p 4 to a content-equivalent meta-walk in Figure !^ it 
also preserves the information of p^. Generally, some meta-walks contain other meta-walks. If a transformation 
preserves the information of a meta-walk, it will preserve the information of its contained meta-walks. Let us 
formalize this relationship between meta-walks. A walk u> is a subwalk of walk x, shown as ic C cc, iff re is a 
subsequence of x. For example, walk [vi,V 2 ,v^] is a subwalk of walk Xi = [vi,V 2 ,V 4 ,V 2 ,v^]. But, walk [ui,U 3 ] 
is not a subwalk of xi because the edge (ui, U 3 ) is not in a;i. Meta-walk p is a subwalk of meta-walk r, denoted 
as p □ r, iff a walk of p is a subwalk of a walk of r. For example, [actor, actors, actor] is a subwalk of [actor, 
actors, film, actors, actor[ in Figure [20 

Definition 4.2. Given meta-walks p and p' in database D, p' includes p iff 

• there is a bijection M between p{D) and p'{D) such that for every walk w G p{D), we have w Q M[w) and 
w and M(w) start at the same node and end at the same node. 

• there exists an entity label I whose occurrence in p' is more than in p, and the closest entity labels to the left 
and to the right of I in p' are not 1. 

For example, meta-walk [actor, actors, film, actors, actor] includes [actor, actors, actor] in Figure I2bl A 
meta-walk p in database D is maximal iff it has a walk in D and it is not included in any other meta-walk. 
For instance, [actor, actors, film, actors, actor] is maximal in Figure I2al Maximal meta-walks subsume the 
information of non-maximal meta-walks. Thus, if a transformation preserves only the information of maximal 
meta-walks in a database, it will preserve the information content of the database. Let V(L) denote the set 
of all meta-walks in the set of databases L. Similarly, we denote the set of all maximal meta-walks in L as 

T’inax(L) . 

Definition 4.3. Transformation T : L ^ K is relationship reorganizing iff there is a bijective mapping 
M : 'Pmax(L’) T’max(^0 ^uch that p = M{p). 

The transformations that map Figure|20to Figure!^ and Figure|T0to Figure [T0 are relationship-reorganizing. 
Theorem 4.4. Every relationship-reorganizing transformation is similarity preserving. 

Proof. Let T : L —K be a relationship-preserving transformation and M : 7^max(L) —> 7^max(K) be the 
bijection that T establishes between maximal meta-walks in L and K. Let M~^ be an inverse of M. Let us 
define T' to be a transformation from K to L as follows. Because M is bijective, for each D' G K, define T' to 
bijectively map a maximal meta-walk p' in D' to a maximal meta-walk M~^{p') in T'{D') s.t. p' = M~^{p'). 
Since we assume in Section [0 that we do not distinguish between isomorphic databases, we show that, VD G L, 
T'{T{D)) and D are the same. 

For each maximal meta-walk p in D, there exists exactly one maximal meta-walk M{p) in T{D) s.t. p = 
M{p). For each maximal meta-walk M(p) in T{D), there exists exactly one maximal meta-walk M~^{M{p)) in 
T'(T{D)) s.t. M~^{M{p)) = M{p). Thus, p = M~^{M{p)). For each w G p{D), there exists exactly one walk 
w' G M~^(M(p))(T'(T(D)) s.t. w = w'. Because is an inverse of M, M~^{M{p)) = p, and so w' = w. 
Hence, there exists a bijection that maps a walk of some maximal meta-walk in T'(T{D)) to the same walk in 
D. Therefore, the sets of walks of maximal meta-walks in D and T'{T{D)) are the same. 

We show that the set of all nodes that appear in a walk of some maximal meta-walk is the set of all nodes 
in the database. Consider that a node u in a database must appear in a walk of some meta-walk p. If p is 
not maximal, then p is included in some maximal meta-walk p'. That is, v appears in a walk of some maximal 
meta-walk in the database. Thus, the set of all nodes that appear in a walk of some maximal meta-walk is the 
same as the set of all nodes in the database. Using similar arguments, we prove that the set of all edges that 
appear in a walk of some maximal meta-walk in a database is the set of all edges in the database. Since the 
sets of walks of maximal meta-walks in D and T'{T{D)) are the same, the sets of nodes and edges in D and 
T'{T{D)) are also the same. Hence, D and T'{T{D)) are the same. Therefore, T is invertible. 

For each entity e in a database D G E, assume e appears in a walk w of some maximal meta-walk in D. 
Using Definition 001 T bijectively maps w to a walk w' of some maximal meta-walk in T{D) s.t. w = w'. Thus, 
there must exist an entity in T{D) with the same label and value as e. Similarly, if an entity / in T{D) exists, 
then / also exists in D. Hence, T is entity preserving. Therefore, T is similarity preserving. □ 
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4.2 Toward Robust Similarity Algorithms 

To the best of our knowledge, the most frequently used methods for similarity search on graph database are based 
on random walk, e.g., RWR m, pairwise random walk, e.g., SimRank m and P-Rank [35], or relationship- 
constrained framework, e.g., PathSim ES1I21I. There are other similarity measures, such as common neighbors, 
Katzp measure, hitting time, and commute time, which can be considered as special cases of aforementioned 
heuristics. Hence, we discuss similarity search methods based on these three frameworks. 

Methods that use random walk and pairwise random walks leverage the topology of a graph database 
to measure the degree of similarities between entities. A relationship-reorganizing transformation may remove 
many edges from and add many new nodes and edges to a database. Thus, it may radically modify the database 
topology. For example, a relationship-reorganizing transformations may drastically change the degree of a node 
and modify the probability that random surfers visit the node. Hence, these methods cannot always return the 
same answers over the original and the transformed database for the same query. In Section [TJ we have shown 
that RWR and SimRank return different results over a database and its relationship reorganization in Figure |T] 

PathSim measures the similarity between entities over a given relationship ESj. For example, it may compute 
the similarity of two movies in a movie database based on their common actors. PathSim uses meta-walks to 
represent relationships between entities. For instance, the relationship between two movies in Figure fTbl based 
on their common actors is expressed by [film,actors,actor,actors,film]. Let p{e,f,d) be a set of walks of meta¬ 
walk p from entity e to entity / in database D. PathSim measures the similarity between e and / according to 
the input meta-walk p as s{e,f) = |p(e ^e^D^\+\p(Jf i>)| ' P^^hSim considers walks with and without consecutive 
forward and backward traverses from an entity to a node without value when it computes s(e, /). 



(a) DBLP-citation (b) SNAP 

Figure 3: Fragments of two citation databases 


PathSim may return different answers for the same queries over the same relationship on a database and its 
relationship reorganizations. Figure [3] shows fragments of DBLP from dblp.uni-trier.de, called DBLP-citation, 
and SNAP from 

snap.stanford.edu that contain information about citations. Consider the meta-walk s = [paper, cite, paper, 
cite, paper\ in Figure [3al and its corresponding meta-walk s' = [paper, paper, paper\ in Figure [3bl s has a walk 
between entities p3 and pA, x = [paper:p3, cite, paper.pA, cite, paper:p4[. But, there is no corresponding walk of 
meta-walk s' between p3 and p4 in Figure [3bl Hence, PathSim reports pi to be more similar to p2 than p3 in 
Figure I3al but considers pi to be more similar to p3 than p2 in Figure I3bl PathSim returns different answers 
because it considers walks with consecutive forward and backward traverses from an entity to a node without 
value, such as x. 

From here onward, we call a walk with consecutive forward and backward traverses from an entity to a node 
without value informative, and non-informative otherwise. As discussed in Section HU non-informative 
walk either do not provide any information about the relationship between entities or their information can 
be represented by a shorter walk. Figure [3^ and Figure I3bl show that non-informative walks may be present 
in a database but be absent from its relationship-reorganizing transformations. Hence, if we modify PathSim 
so that it computes similarity scores using only informative walks, it will be representation independent under 
relationship-reorganizing transformations. Using Dehnition 14.21 and 14.31 we have the following theorem. 

Theorem 4.5. Let T : L ^ K be a relationship-reorganizing transformation and p be a meta-walk in D € L. 
There is a meta-walk r in T{D) such that for each pair of entities e and f in D, we have \p{e, f,D)\ = 
[r{T{e),T{f),T{D))\. 

Proof. Suppose p is maximal. According to Definition 14.31 there is a maximal meta-walk T(p) in T{D) s.t. 
p = T{p). Because there is a bijection that maps each informative walk of p to an informative walk of T(p) with 
equal value, we have |p(e, f,D)[ = \T{p){T{e),T{f),T{D))[. If p is not maximal, according to Definition 14.21 we 
can find a maximal meta-walk p' in D that includes p s.t. [p{e, f, D)[ = \p'{e, f, D)[. Using similar arguments to 
when p is maximal, we have that there exists a maximal meta-walk T{p') in T{D) s.t. \T{p'){T{e),T{f),T{D))[ 
= \p'ie,f,D)[. Hence, b(e,/,D)| = |r(p')(T(e),T(/),T(D))|. □ 

Given entities e and / and a meta-walk p in database D and their corresponding entities and meta-walk in 
T{D), T{e), T{f), and r, the numerator and denominator of s(e, /) will be respectively equal to the numerator 
and denominator of s{T(e),T(f)). Hence, the modification of PathSim will return equal similarity scores for 
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queries over a database and its relationship-reorganizing transformation. We call this extension of PathSim, 
Robust-PathSim (R-PathSim). 

Let us discuss why we can modify PathSim to create a representation-independent algorithm and whether it 
is possible to extend other algorithms, such as RWR and SimRank, and make them representation independent. 
Relationship-reorganizing transformations do not add any relationship to or remove any relationship from a 
database. Because R-PathSim quantifies the amount of similarity separately for each type of relationship 
between two entities, it can return equal scores over a database and its relationship-reorganizing transformations. 
R-PathSim also leverages the concept of meta-walk to detect and ignore the spurious walks in each meta-walk 
that may not be present in some representations of the database. RWR and SimRank do not compute the 
similarity between entities based on a given relationship. One may define RWR or SimRank scores between 
two entities for a given meta-walk |25| . Also, we can modify RWR and SimRank to ignore the non-informative 
walks. Analyses of such extensions are interesting subjects of future work. 

The computation of R-PathSim is similar to that of PathSim [53] with extra steps of detecting and ig¬ 
noring non-informative walks. The commuting matrix of meta-walk p = ,1^] in database D is Mp = 

^hh-^hh ■ ■ ■ -^ik-ihi where Aip. is the adjacency matrix between nodes of labels li and Ij in D. Each entry 
Mp(i,j) represents the the number of walks between entities i G li{D) and j € lj{D). Given commuting matrix 
Mp, we can compute the PathSim score between i and j as j) ■ However, R-PathSim uses only the 

informative walks. A meta-walk whose walks may not be informative is in the form oi p = [h, ■ • ■ , U, Xn,, ■ ■ ■ 
,Xmi,h^ 1 < i < k where liS are entity labels and Xm,--- ,Xmi are labels of nodes without values. 

Meta-walk p may have non-informative walks because it contains meta-walks Si = [li, Xm , • • ■ , Xrm Let A/g. 
be the commuting matrix of Si. The diagonal entries in contain the number of non-informative walks of 
Si- Let Mg. denote a diagonal matrix of Mg^. Matrix Mg. — Mf. contains the number of informative walks of 
Si- To compute the number of informative walks of meta-walk p, we first find subwalks of p that start and end 
with same entity label and their remaining labels are non-entity labels. We call this set of meta-walks S and 
denote the rest of the subwalks of p R. The number of informative walks of p between each pair of entities in 
DisM;=lls^siMg-Mf)llreRMr. 

It may take a long time to compute the commuting matrix of a relatively long meta-walk in query time. 
Also, it is not feasible to precompute and store the commuting matrices for every possible meta-walk. PathSim 
precomputes commuting matrices for relatively short meta-walks. Then, PathSim concatenates them in the 
query time to get the commuting matrix of a longer meta-walk. This approach efficiently computes PathSim 
scores m- We follow the same method to compute R-PathSim scores efficiently. 

Users may not know the structure of the database and cannot supply any input meta-walk. One may solve 
this problem by computing the (weighted) average of similarity scores over maximal meta-walks between entities 
|25| . Definition 14.31 provides that there is a bijection between all maximal meta-walks in a database and its 
relationship-reorganizing transformation. Also, Theorem 14.51 guarantees that R-PathSim returns equal scores 
for each maximal meta-walk over a database and its transformation. Hence, the combined similarity scores are 
equal in the original and transformed databases. 

In order to find a set of maximal meta-walks, we first find a set of meta-walks in the database. Then we 
check if the meta-walks found are maximal or not. Algorithm [T] provides a framework on checking whether 
a given meta-walk is maximal. The underlying idea is that, if a meta-walk p is not maximal, there exists a 
meta-walk p' that includes p. That is, p' must contain an additional entity label to p. Using Definition 14.21 we 
check whether each walk of p is a subwalk of exactly one walk of p'. If there is p' that includes p, then p is not 
maximal. Otherwise, p is maximal. The running time of Algorithm [T] is OiruPm) where n is the size of a given 
meta-walk p, d is the average degree of nodes, and m is the number of walks of p in the database. 

For further optimization, if there are many meta-walks between the query node and the candidate answers 
in the database, one may save processing time by limiting the set of meta-walks over which the aggregated score 
is computed. One may do so by selecting the maximal meta-walks p = rr~^, where r~^ is a meta-walk that 
is the reverse of r, such that r contains only distinct entity labels and only a given number of entity labels. 
Definition 14.31 guarantees that, for each maximal meta-walk r, there is exactly one maximal meta-walk r' in the 
transformed database such that r = r'. Further, the number of entity labels of r and r' must be the same. That 
is, if p is used over the database, then p' = r'r'~^ is also used over its transformation . Similar to Theorem l4.5l 
we have that the R-PathSim score computed using p over the database and using p' over its transformation 
are equal. Therefore, the aggregated R-PathSim score computed over these sets are equal across the original 
database and its transformations. 
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Algorithm 1: Check a meta-walk if it is maximal 
Input: Database D = (V, E, L, A), Meta-walk p = [Zi,/„] 
Output: ACCEPT if p is maximal, or REJECT if p is not maximal 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


foreach i = 2...n — 1 do 

Si •<— set of all meta-walks \li^ /] or [U, I', /] in D where / is an entity label and V is not an entity label 
foreach r € Si do 

foreach w = [iii,G p{D) do 

if there exists no walk or more than two walks in r{D) from Vi then 

/* Assume p' = [/i, li]rr~^[li,In] where p C p'. p' does not include p. */ 

Go to process next r € Si 

end 

end 

/* Each walk of p is a subwalk of exactly one walk of p'. Hence, p' includes p. */ 
return REJECT 
end 
end 

return ACCEPT 


5 Entity Rearrangement 

5.1 Entity-Rearranging Transformation 

Different databases may represent the same relationship between a set of entities by connecting them using 
different sets of edges. Consider Figure|3]that shows the original and an alternative representation for Microsoft 
Academic Search (academic.research.microsoft.com) (MAS for short) data. Both databases contain entities 
of semantic types paper, conference, domain, and keyword, which are labeled as paper, conf dom, and kw, 
respectively. The domains of papers and conferences show their areas, e.g., database and data mining. The 
keyword entities contain the keywords of domains, e.g., indexing for database domain. Each paper is published 
in only one conference and each conference belongs to only one domain. The database in Figure Hal expresses 
the relationship between a paper and its conference and domain by connects each paper to both its conference 
and its domain. On the other hand, the database in Figure |4b] represents the same relationship by connecting 
each paper to its conference and connecting each conference to its domain. We call this representational shift 
that rearranges entities in a database an entity-rearranging transformation. 


conf:d 


conf:c 

confih 

confia 


dom:: 



kw:k 


kw.j 


kw:i 


dom:: 



conf:a 


kw.k 




kw:i 


(a) Original representation (b) Alternative representation 

Figure 4: Fragments of some representations for MAS data with FDs paper —>■ conf and conf —^ dom. 

Because each paper in Figure l4bl has only one conference and each conference has only one domain, we can 
switch the relative position of paper and conference and get Figure 0^ without losing or adding any relationships 
to the ones represented in Figure ldbl Assume that a paper can be published in multiple conferences from different 
domains in a database that follows the representation of Figure |4bl If we rearrange the positions of papers, 
conferences, and domains in this database according to the representation in Figure I4al each conference of a 
paper will be connected to all domains of every conference in which the paper is published. Hence, we will add 
new relationships between conferences and domain that are not available in the original database. Also, we 
will not be able to recover the original set of relationships between a conference of its domain in the original 
database. Hence, an entity-rearranging transformation preserves the information of a database if certain entities 
in the database satisfy some dependencies. The following definition formalizes these dependencies. Let 1(D) 
denote all nodes in database D with label 1. 

Definition 5.1. Given meta-walk p —fh, ..., In] in the set of databases L, L satisfies functional dependency 
(FD) li A- In iff for every D G L if walks /e,...,// and [e,... ,g] of meta-walk p are in D, then f = g. 
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For example, the FDs in Figure Hal are paper conf, paper dom, and conf dom, where pi = [paper^conj], 
P 2 = [paper,conf,dom], and p^ = [con/,do?7i]. Given meta-walk p =]li, 12 ], we write the FD li A- I 2 as li —>■ I 2 for 
brevity. 

Intuitively, an entity-rearranging transformation should preserve the label and values of entities, the relation¬ 
ships between entities, and the FDs of a database to preserve its information. For example, there is a bijection 
between entities in Figure Ha] and Figure |4b] that preserve their labels and values. Moreover, if there is not any 
FD between some entities, an entity-rearranging transformation must not rearrange them. In other words, if we 
have edge (e, /) in database D and there is no FD between e and /, an entity-rearranging transformation T must 
map e and / to entities T{e) and T(/) in T{D) with edge {T{e), T{f)). Similarly, if there is no edge between the 
aforementioned entities in D, there must not be any edge between them in T{D). Furthermore, the transformed 
database must satisfy essentially the same FDs as the original database. That is, if there is an FD between 
entities of semantic types li and I 2 in the original database, there must be an FD between the entities of li and 
I 2 in the transformed database. Otherwise, as explained in the preceding paragraph, the transformation may 
introduce spurious relationships between entities. However, the corresponding FDs in the original and trans- 

formed databases may be represented using different meta-walks. For instance, FD conf -^^- > dom 

in Figure Hal is mapped to conf —^ dom in Figure HE! The following definition formalizes the aforementioned 
intuitions. Let Fl denote the set of FDs satisfied by the set of databases L. 

Definition 5.2. A transformation T : L ^ K that maps database D = {Vd, E]j, C,Ad) to database T{D) = 
Ex[d), Ei, ^t(d)) is entity rearranging iff there is a bijection M : Vd —t ffy(D) such that 

• for all V S Vd, E{v) = K{M{v)) and if v is an entity, Aoiv) = At(d){AI{v)). 

• for all {u,v) G Vd where neither C{u) —^ C{v) nor C{v) E{u) are in Fl, we have {u,v) G Ed iff 

{M{u), M{v)) G Et(d)- 

• there is a bijection N : Fi —>■ Fk such that if Nifi A I 2 ) = W -A I 2 , for all entities e,/ G Vd, p{e,f,D) is 
empty iff p'{M{e), M{f),T{D)) is empty. 

Using Definitions 15.11 and Definition 15.21 we have the following. 

Theorem 5.3. Each entity-rearranging transformation is similarity preserving. 

Proof. Let T : L K be an entity-rearranging transformation. For each D = {V, E, C,A) G L, let M be 

the bijection that T establishes between the set of nodes in D and the set of nodes in T{D) according to 

/ 

Definition 15.21 Let N ■. Fl ^ Fk be the bijection that T establishes s.t. if N{li A I 2 ) = h A I 2 , for all 

entities e,/ G Vd, p{e,f,D) = 0 iff p'{M{e),M{f),T{D)) = 0. Let M~^ and N~^ be the inverses of M and 

N, respectively. Let us define a transformation T' from K to L as follows. For each D' = {V,E',K.,A') G K, 
(1) Vu G V', /C(v) = £(M-fyA) and A'(v) = A(M-^(v)), (2) Vu,v G V', if/C( m) ^ /Cfy), /Cfy) ^ /C(u) ^ Fk, 

then {u,v) G V iff {M~^{u), M~^{v)) G V, and (3) N~^ bijectively maps A Z 2 G Fk to A ^2 G F^ s.t. 
Ve, / G V, p'{e,f,D') = 0 iff p(M~^{e), M~^{f),T'(D')) = 0. Next, we show that T'{T{D)) and D are the 
same database. 

Let Vd, Vt(d) ^md Vt'{t(d)) denote the sets of nodes in D, T{D) and T'{T{D)), respectively. Using 
Definition 15.21 and the construction of T', we construct a bijection FI : Vd A'(t(d)) s.t. FI = M~^ o M. For 
each V G Vd, the labels and values of v and H{v) are the same. Thus, Vd = Vt'(t{d))- Consider for each edge 
e = {u,v) in D. If C{u) —>■ C[v),C{v) —?> F{u) ^ Fl, then lC{M{u)) IC{M{v)), IC{M{v)) —>■ IC{M{u)) ^ Fk, 
and jC{M~^{M{u))) —>• C{M~^{M{v))), C{M~^{M{v))) —^ C{M~^{M{u))) ^ F^. Using Definition 15.21 and 
the construction of T', we have that {u,v) exists in D iff {FI(u), F[(v)) = {M~^{M{u)), M~^{M{v)) exists in 
T'{T{D)). Otherwise, without losing generality, assume C{u) —^ C{v) G F^. Let p = [£(u),£(u)]. N bijectively 

maps C{u) —^ C{v) G Fl to C{u) A C{v) G Fk for some meta-walk p'. Also, N~^ bijectively maps C{u) A 
C{v) G Fk to C{u) —>■ C{v) G F^. N guarantees that, 'iu,v G Vd, p{u,v,D) = 0 iS p'{M(u), M{v), T(D)) = 0. 
Also, N~^ guarantees that, \/u',v' G Vt(d), p'{u',v',T{D)) = 0 iff p{M~^{u'),M~^{v'),T'{T{D))) = 0. That 
is, Vu, u G Vd, p{u,v,D) = 0 iff p{F[{u),F[{v),T'{T{D))) = 0. Because we assume that our data graph is 
simple, if p{u, v, D) is not empty, then there is exactly one walk [it, u] in p{u, v, D) which is a walk along an 
edge {u,v). Similarly, \i p{F[{u),H{v),T'{T{D))) is not empty, then an edge {F[{u),H{v)) exists in T'{T{D)). 
That is, {u,v) exists in D iff {H{u),F[{v)) exists in T'{T{D)). Thus, D and T'{T{D)) are the same, and so T 
is invertible. 

Using the first condition in Definition l5.21 each entity-rearranging transformation is entity preserving. There¬ 
fore, every entity-rearranging transformation is similarity preserving. □ 
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Entity-rearranging transformations resemble (de)normalization in relational and tree-shaped XML databases 
mm- They, however, modify the connections between entities in the database instead of removing duplicates 
and are defined over graph databases that follow less restrictive schemas than relational schemas or DTDs. 

5.2 Extension of R-PathSim 


Because entity-rearranging transformations modify the topology the database, RWR and SimRank are not 
robust under these transformations. For example, consider the entity-rearranging transformation between Fig¬ 
ure 0^ and Figure l4bl RWR and SimRank find paper.-p to be more similar to papers than paper.t in Figure l4bl 
However, they find paperp to be more similar to paper.t than papers in Figure Hal R-PathSim and PathSim are 
also not robust under entity-rearranging transformations. A user may like to find conferences similar to conf.h 
based on their common keywords using meta-walk pi = [conf, dom, kw, dom, conj\ in Figure [4bl R-PathSim 
finds conf.a, and conf.c equally similar to conf.h. The meta-walk that represents the closets relationship to pi 
in Figure Ha] is vn = [conf paper, dom, kw, dom, paper, conf. However, using meta-walk p 2 ) R-PathSim finds 
conf.a more similar to conf.h than conf.c in Figure Hal 

We observe that meta-walk p 2 does not exactly represent the same information as meta-walk pi because p 2 
contains additional entity labels, i.e., paper. Hence, a walk in pi may correspond to several walks in p 2 . This 
causes R-PathSim to produce different rankings for the same query over Figures Hal and Hbl To return the same 
answers over Figure |3a| and Hbl one may look for a structure in Figure Ua] that represents exactly the same 
information that pi expresses in Figure Hbl Every walk of pi represents the fact that a conference belongs to a 
certain domain and does not contain any information about the number of papers published in the conference. 
Hence, we extend the definition of meta-walk to ignore the number of occurrences of certain entities in a walk. 
For example, we define meta-walk p^ in Figured^ whose walks express the fact that entities of labels con/and 
dom are connected through paper entities without any regard to the number of papers between them. This 
meta-walk treats all walks between each pair of entities of label conf and dom through paper entities as a single 
walk. We show p^ as [conf, paper, dom, kw, dom, paper, conf. We call paper a *-label in p^. Using a =is-label in 
the meta-walk indicates that the user is interested in whether a connection between entities in the meta-walk 
exists. Meta-walk p^ has the same number of walks in Figure |4^ as pi has in Figure |4bl Hence, R-PathSim will 
deliver the same ranking for query conf.h over Figure Housing meta-walk p^ and Figure Hbl using meta-walk pi. 

Furthermore, we may have to use a more complex meta-walk in a database to express the same information 
as a simpler meta-walk in the entity-rearranging transformation of the database. For instance, a user may like 
to find similar conferences based on the meta-walk p2 in Figure Hal However, she must use a more complex 
meta-walk [conf, paper, conf, dom, kw, dom, conf, paper, conf to obtain the same results in Figure Hbl Instead of 
stopping at the candidate answer label, this meta-walk goes beyond and traverses back to the candidate answer 
label. We call this type of meta-walks meta-walks with repeated entities. 

Hence, a relationship between the same set of entities may be represented by normal meta-walks, as defined 
in Section 0] in a database, but using meta-walks with *-label or repeated entities on its entity-rearranging 
transformations. To be robust over entity-rearranging transformations, R-PathSim should consider meta-walks 
with =i=-label and repeated entities in addition to the meta-walks defined in Section01 Thus, we extend R-PathSim 
to consider these types of meta-walks. Nevertheless, if we allow =i=-label for every label in each meta-walk, R- 
PathSim has precompute the commuting matrices for a large number of meta-walks. Hence, we would like to 
identify a minimal set of meta-walks with =i=-label(s) that capture all relationships in a database and R-PathSim 
can use them to deliver the same results over the database and its entity rearrangements. First, according to 
Definition 15.21 if neither meta-walk p nor any of its subwalks is a meta-walk for any FD over a database D, there 
is a meta-walk r over the entity-rearranging transformation of D, T{D) such that r and p have equal number 
of walks in D and T{D), respectively. Thus, R-PathSim score of entities over these meta-walks are equal over 
D and T{D). Hence, we assign =i=-label only on a meta-walk that determines some FD in a database. Second, 
consider meta-walk s = [li, ... ,lk] in database D where li —^ I 2 , h I 3 , ..., Ik-i —^ Ik hold in D. If there is a 
walk from entity e of label li and entity / of label Ik in D, there will be exactly one walk between e and / in 
D because of the FDs over s. Hence, every meta-walk created by setting some of the labels in s to =i=-label have 
the same number of walks in D as s has. Intuitively, setting some labels in s to =i=-labels will not express any 
new useful relationship between entities in s. Moreover, because R-PathSim returns the same similarity score 
for two entities using s and its modifications, we will consider only s. 

Next, we prove that the aforementioned extension of R-PathSim is representation independent over entity¬ 
rearranging transformations. Let L be a set of databases whose set of labels is L. Let 5” C L be a set of 
labels in a database. We define a binary relation -<s between labels h and I 2 in L where li -<s I 2 iff there is 
a meta-walk p whose labels exists in S such that h ^ I 2 ^ Fj_,. We define 5” to be a chain iff is a total 
order over S. 5” is a maximal chain iff there is no R C L such that SCR and R is a chain. For instance. 


because we have paper —^ conf, paper —> dom and conf 


[conf,paper, dom] 


> dom in Figure Hal {paper, conf, dom} is 
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a maximal chain. By the abuse of notation, we let Fs denote the set of FDs in L whose labels are in S. In 
this paper, we focus on a set of databases whose sets of maximal chains are mutually exclusive. MAS databases 
whose fragments are shown in Figure Ha] and |4b] are examples of such databases. 


Theorem 5.4. Given an entity-rearranging transformation T : L ^ K, for all D G L, T bijectively maps 
each meta-walk p in D to a meta-walk r in T{D) such that for all entities e and f in D, |p(e,/, ZI)| = 
|r(r(e),T(/),Tp))|. 


Proof. Let us define internal labels of a meta-walk p = [/i,..., !„] as labels I 2 , ...In-i- Given a meta-walk p in D, 
one can write p as a concatenation of meta-walks si...Sm where m is the smallest value s.t. each Si, i = l...m, 
satisfies exactly one of the following conditions: (1) Si is not a meta-walk of any FD in L, (2) all internal labels 
of Si are ^-labels, or (3) Si = [l [,..., where l[ —>■ I 2 ,..., 1'f.G Fl (or 1'^. I'^-n ■■■, I 2 ^ I'l & Fl)- Clearly, 

no Si satisfies both conditions (1) and (3), or both (2) and (3). Because we set the labels of only meta-walks used 
in an FD to *-labels, no Si satisfies both conditions (1) and (2). Suppose there are more then one concatenation 
ofp that satisfies the aforementioned conditions. Without losing generality, assume p = S 1 S 2 = where si = 
S 2 = [/fe,...,Z„]. If Si satisfies condition (1), then S 2 satisfies either condition (2) or (3). Otherwise, 
TO = 2 is not the smallest number that satisfies the aforementioned concatenation for p. Clearly, s'l cannot 
satisfies condition (2) or (3). Also, [Zi,...,/fe,..., 1^/], k < k' < n, cannot satisfies condition (1). Hence, si = s'l 
and S 2 = S 2 . If Si satisfies condition ( 2 ), then any contiguous proper subwalk of si and a walk [li ,..., Ik ,..., Ik'], 
k < k' < n, cannot satisfies any of the aforementioned conditions. Hence, s'l = si and S 2 = S 2 . If si satisfies 
condition (3) and S 2 either satisfies condition (1) or (2), then any contiguous proper subwalk of S 2 and a walk 
[Zfe',..., Zfe,1 < fc' < fc, cannot satisfies any of the aforementioned conditions. Thus, S 2 = S 2 , and so 
s'l = si. If Si and S 2 satisfy condition (3), then either (a) li I 2 ,..., h-i h and In In-i, ■■■, Ik-ki h, or 
(b) Ik —^ Ik-i, ■■■jh h and Ik —>■ Zfc+i, ...,Z„_i —>■ In- Thus, any contiguous proper subwalk of Si and a walk 
[Zi, ...,lk, ■■•jZfe'], k < k' < n, cannot satishes any of the aforementioned conditions. Hence, s'l = si and S 2 = S 2 . 
Therefore, there is exactly one concatenation of p that satisfies the aforementioned conditions. 

Let Si = [l [,..., Z(,]. Suppose Si satisfies condition (I). Using Definition [521 there is a bijection between walks 
of Si and walks of T{si) s.t. |si(e, /, D)\ = \T{si){T{e), T{f), T{D))\. Suppose Si satisfies condition (2). Because 
we set the labels of only meta-walks used in an FD to *-labels, we have l[ ^ Z(, (or l[ 4 ^ Z(.). By Definition [521 
T bijectively maps Z( Z(, to some Z( ^ Z(, in Fk s.t. Si{e,f,D) = 0 iff ri{T{e),T{f),T{D)) = 0. If 
Si{e,f,D) ^ 0, then \si{e, f, D)\ = 1 and ri{T{e),T{f ),T{D)) ^ 0. Let r* be obtained by changing all internal 
labels of r* to ^-labels. If rj = [Z'l, Z'/,..., Z^',, Z(,] and Zj —>• l'(,..., I'f, Z(,, then r* = r^. If ri{T{e),T{f ),T{D)) ^ 0, 
then |r*(r(e), T(/), T(D))| = 1. Hence, |si(e, /, D)| = |r*(r(e), T(/), T(D))|. Suppose Si satisfies condition 
(3). If Si{e,f,D) 7 ^ 0, then \si{e, f, D)\ = 1. Similar to the case where Si satisfies condition (2), we prove that 
T bijectively maps Si to r* s.t. \si{e, f,D)\ = \r*{T{e),T[f),T[D))\. 

The end node of each walk in Si is the start node of a walk s^+i. Hence, by Definition 15.21 the end node 
of each walk of T{si) is the start node of a walk of r(si_|_i). Let r be the meta-walk created by concatenating 
T(si)’s. Each walk of r is a concatenation of walks of T{si) in T{D). □ 


Similar to Section|421 one may compute a single similarity between a pair of entities by computing the average of 
R-PathSim scores over all meta-walks between the pair of entities. Theorem [52] guarantees that the aggregated 
similarity scores for each pair of entities and their mapping over entity-rearranging transformations are equal. 
We use the same methods discussed in Section 112 ] to precompute and compute the score of meta-walks with 
=i=-labels and repeated entities. Our results here introduce a new method to make a similarity search algorithm 
representation independent. Because the same relationship may be expressed in several forms over different 
representations of the same data, the algorithm should consider more varieties of relationships. 

Theorem 15.41 guarantees that aggregated R-PathSim computed over all meta-walks including ones that use 
=i=-label or repeated nodes returns the same ranked list of answers as aggregated R-PathSim computed over 
those meta-walks in the transformed databases under entity rearranging. Since the set of all meta-walks in a 
database is infinite, it is impractical to compute R-PathSim over all meta-walks in a databases. Limiting the 
size of meta-walks to be computed as in Section 14.21 helps reducing the number of meta-walks to be computed 
over; however, this solution does not guarantee that the results over the database and its entity-rearranging 
transformed database are the same. Suppose a database in Figure |32 is a fragment of a database D that 
contains labels that are not exists in Figure 0] and those labels are not part of any functional dependencies. 
Let D be transformed to a database E by the entity-rearranging transformation used in transforming Figure Hal 
to Figure l4bl Consider that a mapping to a meta-walk pi = [conf paper, dom, paper, eonf\ in D is p 2 = [conf 
paper, conf, dom, conf, paper, conf in E. Assume the limit of meta-walk is a large number N. There exists 
a meta-walk p = rpis of size N in D where r and s do not contain any edge that is part of any FD in D. 
The mapping that follows Theorem 15.41 in E is p' = rp 2 S in which the size of p' is more than p. Following this 
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example, we can argue that no matter how big the size limit of meta-walks is, there is an entity-rearranging 
transformation such that there is a meta-walk whose mapping meta-walk that follows Theorem 15.41 has the size 
that exceeds the given limit, or vice versa. 

For aggregated R-PathSim to be representation independent while being accurate and efficient, we should 
only compute aggregated R-PathSim over a subset of meta-walks in a database. Given a set E of entity labels 
in a database with a set F of FDs, we propose Algorithm [3] that finds a subset of meta-walks whose labels exists 
in E or appears in some chain in F. 

Algorithm [3] first finds a subset S of meta-walks without any repeated labels that contains only entity labels 
from E or some chain in the database. Then it adds more meta-walks to the set S by modifying each meta-walk 
in S if its edge is used in an FD in some chain. While keeping the part of each meta-walk whose edges does 
not appear as any FD intact, the algorithm modifies other parts by either extending them to reach the label 
that determines other labels in the chain, or otherwise mark the labels in these parts as ^-labels. Finally, 
each meta-walk S is concatenated with its own reverse so that the meta-walk starts from the query label and 
end at the same query label. Suppose the maximum number of distinct labels that are adjacent to a node 
is d. Let \Lf\ denote the set of labels of FDs in F. Since S is constructed such that no labels are repeated 
in each meta-walk, then S contains at most 0{dl) meta-walks before the modification. In each meta-walk, 
there can be at most 0{\Lp\) contiguous subwalks that involves an FD in some chain. Thus, there are 
possible modifications for each meta-walk. Therefore, the size of the returned subset of meta-walks is 0{d\2^^^^ ). 
Follows Proposition 15.61 the R-PathSim score computed over the set obtained by Algorithm [5] is the same as 
the R-PathSim score computed over the set of meta-walks obtained by this algorithm over the transformed 
database. 


Lemma 5.5. Let L be a chain in a database. Let h ^ In € Fp where p = [Zi,..Zi,...,/„ S L. For all 


i = 2...n — 1, there is no j € { 1 ,..., * — 1 } and k G {i + 1 ,..., n} such that Ij IAl 




> k G Fl. 


Proof. Given a database D G L. Suppose there exists some i G {2, ...,n — 1} where there are j G {1, ...,i — 1} 

and k G {i + 1^ ..., n} s.t. Ij - ^> k, Ik - ^> h G Fp. Let pi = [Ij ,..., k] and p 2 = [Ik, ■■■, k]- Then there are 
entities e, /i, /2 7 ^ /i and g whose labels are h, In, In and k, respectively, s.t. pi{e,g,D) 7 ^ 0, p 2 {fi,g,D) 7 ^ 0 
and P 2 {f 2 ,g, D) 7 ^ 0. Since p = P 1 P 2 , p(e, fi,D) 7 ^ 0 and p{e, / 2 , D) 7 ^ 0. That is, li A In does not hold in Fp 
which is contradiction. □ 


Proposition 5.6. Let T : L —7 K 6e an entity-rearranging transformation. Given a query q in a database 
D G L, let Sd and St{d) be sets of meta-walks returned by Algorithm @ using the same inputs over D and 
T{D), respectively. The aggregated R-PathSim score over So for each candidate answer in D, and the aggregated 
R-PathSim score over S't(D) for the same candidate answer in T{D) are the same. 

Proof. Consider each meta-walk m G Sd. We have that m = pp~^ for some meta-walk p that starts with label 
I in D. Similarly, for each meta-walk m' G Sd(^d), na' = p'p'~^ for some meta-walk p' that starts with label 
I in T{D). By the construction of p by Algorithm [21 we can write p as p = pi...pk where k is the smallest 
s.t. each pi, i = l...k, follows either (1) for every edge {u,v) of pi, m —7 n, n —7 m ^ Fd, or (2) there exists 
a maximal chain C in T{D) s.t. for every edge {u,v) of pi, u —7 n or n —^ u exists in C. Without losing 
generality, we will show that there is a bijective mapping M between pi in D and pj in T{D) s.t. for every pair 
of entities e and /, |p(e,/, D)| = |p'(e, /, r(D))|. By Definition 15.21 each pi that follows condition (1) exists in 
both D and T{D). Further, each walk of pi exists in both D and T{D). Thus, for every pair of entities e and /, 
|pi{e, f,D)\ = |pi(e, f,T{D))\. For case (2), assume pi = [Zi,..., Z^,] where Zi,..., In belongs to some maximal chain 
C in D. Let Iq be the smallest in C under -<c. We prove pi for each of the following cases. (Case 1) Suppose 
h = lo (or In = lo)- By Definition 15.21 and because we assume that sets of maximal chains in a databases are 
mutually exclusive, there exists exactly one meta-walk p( s.t. every label of pi exists in C. By Lemma [5.5l we have 
that Zi I2,..., In-i In in L). For every pair of entities e and /, |pi(e, /, D)\ = 1. Because pi must also starts 
and ends with the same labels as those of pi, using similar arguments, then |p'(e, f,T{D)) \ = |pi(e, f,D)\. (Case 
2) Suppose li 7 ^ lo for any i = l...n, or I2,..., In-i are ^-labels. By Lemma l5.51 we have that Zi —7 I2,..., Z„_i —7 In 
(or Zi ■(— l 2 ,...,ln-i ^ In) in D. Hence, for every pair of entities e and /, |pi(e,/, D)| = 1. Definition 15.21 

bijectively maps Zi -A In in D to Zi -A In where every labels of p' belongs to C. If lo does not appear 
in p', then using similar argument, |p'(e,/, r(D))| = 1 for every pair of entities e and /. Otherwise, the 
algorithm marks all labels except the first and the last in p' as ^-labels, then |p'(e,/, r(D))| = 1. Therefore, 
|p'(e, /, r(D))| = \pi{e, f, D)\. (Case 3) Suppose Ij = lo for some j = l...n. By Lemma [531 we have that 
h ^ Ij, Ij Ij-i-i, ■■■,ln-i In in D. By definition of FD, |pi(e,/, D)| equals to the number of 

entities g of labels lo that exists in walk of pi from e to / in D. That is, |pi(e, f,D)\ = | [Zi,..., Zj](e, p, D)| = 
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Algorithm 2: MetaWalkFinders 

Input: Database D, entity label I, set of entity labels L, integer R 
Output: Subset S of meta-walks in D whose labels starting and ending with I 


1 S ^ {} 

2 C ■(r- set of maximal chains in D 

3 Lc ^ set of entity labels that exists in C 

4 Laii •(— L U Lc 

5 P <r- set of meta-walks in D whose occurrence of each entity label is at most R that start with label I 
and contain only entity labels from Laii 

6 M ^ 0 

7 foreach li,l2 S Lc do 

8 M[Zi][/ 2 ] MetaWalksFinderFromChain{F,li,l 2 ) 

9 /* By the construction of P, li ^ I 2 */ 

10 end 

11 foreach p" G P do 

12 S'^{[1]} 

13 Construct an ordered list Parts = (pi, ■■■,Pk) s.t. p” = pi...pk where k is the smallest such that each 

Pi, i = l...k, is either a meta-walk whose edges are not used in any FD or a meta-walk whose labels 
exists in a single FD in F. 

14 foreach p' G Parts do 

15 if every edge {u,v) in p',u ^ v does not appear in any chain in F then 

16 /* Keep partition p' of p” whose edges are not used in any FD as is. */ 

17 foreach p G S' do 

18 Remove p from S' 

19 Add pp' to S' 

20 end 

21 else 

22 /* Replace the partition of p" whose edges are used in some FD by using its maximal chain */ 

23 ^1 ^ first label of p' 

24 I 2 ^ last label of p' 

25 foreach p G S' do 

26 Remove p from S' 

27 foreach p'" G M[R][l 2 ] do 

28 I Add pp'" to S' 

29 end 

30 end 

31 end 

32 end 

33 S^SVJ S' 

34 end 

35 /* Concatenate these meta-walks with its reverse so that each meta-walk visits label I at the end */ 
foreach p G S do 

36 I Replace p with pp~^ 

37 end 

38 return S 


J2g\[^jT--y^ri]{g, f,L>)\. Because li,ln G C, Definition 15.21 biiectivelv maps the FD between R and In in D 
to li ^ In in T{D) for some meta-walk r = [I'l, ■■■,l'ni] whose labels are in C, and I'l = h and 

I'n' = 1‘n- If If = lo for some j' = by using Lemma [531 we have that l'^ G- 1'2, ...,l'f G- ^ 

^ in D. That is, |r(e,/, T(D))| = = TgWf ■■■X]{gJ,T{D))\. 

Using definition of FD and Definition 15.21 ■■■,l'j\{e,g,T{D))\ = g, D)\. Let p- = r, we 

have |p'(e,/, T(D))| = \pi{e,f,D)\. Otherwise, there is no j' = l...n' s.t. Iji = lo- By Lemma 15.51 we have 
that l'^ -G I 2 , C'-i ^ C' ^1 ^ ^ 2 ) C'-i ^ C') Because sets of maximal chains are mutually 

exclusive, there exists exactly one meta-walk s = [/",..., 1"//] whose labels are in C\{l'i ,..., Z^/} s.t. lo A l'^ where 
l'{ = lo and = h- Further, l'{ -G 1'2, ■■■,l'n"-i I'n' (or I'l ^ I'l: ^ The algorithm constructs 
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Algorithm 3: MetaWalkFindersFromChain 


Input: Set of maximal chains F, labels Li, L 2 , Li ^ L 2 

Output: Set M of meta-walks from li to I2 whose labels are from the maximal chain 

1 M ^ 0 

2 Find / = {^1 I 2 , 12 h, ■■■, In-i — In} G F s.t. Li = Ij and L 2 = h, for some j, k = l...n 

3 /* There is at most one such / because sets of maximal chains in a database are mutually exclusive. */ 

4 if/ exists then 


5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 
23 


if / > fc then 

Swap labels between Ij and Ik 
swap ^ true 

end 

Find Ij Ik & F 
Add Si to M 
if li appears in si then 
Sj ■<— copy of Si 

Mark any valid internal label of s[ as *-label 
Add s'l to M /* case 1: *-label */ 
else 

Find li Ij G F 

Add s/^S 2 Si to M /* case 2: extends S 2 to reach li */ 

end 

if swap then 

foreach p G M do 

I Replace p with 

end 


end 

24 end 

25 return M 


a meta-walk p[ = s ^sr in T{D). By definition of FD, for each pair of entities e and /, r{e, f,T{D))\ = 1 if a 
walk of r exists between e and /. Also, \s~^s{e,e,T{D))\ = g,T{D))\ = J2g \sig,e,T{D))\. Thus, 

b'(e,/,T(F))| = |s-is(e,e,T(F))||r(e,/,r(F))| = e, T(F))|)|r(e,/, r(F))| = E e, r(F))| = 

Eg \ [h, g, D)\ = \pi{e, f, D)\. Hence, the bijectivity of M holds with the desired properties. Therefore, 

the theorem holds. □ 

6 Empirical Evaluation 

6.1 Experiment Settings 

We use 5 datasets in our experiments. We use a subset of DBLP data with 1,227,602 nodes and 2,692,679 
edges, which contains information about publications in computer science. We add information about the 
area for each conference in DBLP from Microsoft Academic Search. Figure Ea| shows fragments of DBLP. We 
also use a subset of Microsoft Academic Search data with 44,044 nodes and 44,196 edges whose fragments are 
shown Figure Ha] We use Arxiv High Energy Physics paper citation graph from SNAP with 34,536 nodes 
and 42,158 edges whose fragments are shown in Figure I3bl We use a subset of IMDb data with 2,409,252 
nodes and 7,525,281 edges whose fragments are shown in Figure [5al We also use WSU course database from 
cs.washington.edu/research/xmldatasets with 1,124 nodes and 1,959 edges, which contains information about 
courses, instructors, and course offerings. Figure [7a] shows fragments of this dataset. We implement our and 
other algorithms using MATLAB 8.5 on a Linux server with 64GB memory and two quad core processors. 

6.2 Representation Independence 

We use normalized Kendall’s tau to compare ranked lists. The value of normalized Kendall’s tau varies between 
0 and 1 where 0 means the two lists are identical and 1 means one list is the reverse of the other. As users are 
interested in the highly ranked answers, we compare top 3, 5 and 10 answers. 

Relationship Reorganization: Because it takes too long to run SimRank and RWR over full IMDb 
dataset, we use the largest subset of IMDb with 47,835 nodes and 130,916 edges over which we can run SimRank 
and RWR reasonably fast to evaluate their robustness. We set the restart probability of RWR and the damping 


16 












IM2MV 

IM2AS 

IM2FB 

DB2SI 

WS2AL 


RWR 

0.473 

0.505 

0.170 

.482 

.300 

Top 3 

SimRank 

0.411 

0.458 

0.333 

.481 

.440 


PathSim 

- 

- 

- 

.641 

.320 


RWR 

0.444 

0.459 

0.158 

.447 

.259 

Top 5 

SimRank 

0.365 

0.392 

0.337 

.455 

.387 


PathSim 

- 

- 

- 

.608 

.310 


RWR 

0.404 

0.415 

0.155 

.412 

.253 

Top 10 

SimRank 

0.343 

0.348 

0.322 

.410 

.341 


PathSim 

- 

- 

- 

.590 

.247 


Table 1: Average ranking differences for all transformations. 


factor of SimRank to 0.8. We reorganize IMDb database to the structures of Freebase (FB), Movielicious 
(MVL) and a structure from evc-cit.info/ citO^lx/ assignment_css.html (ASM) whose fragments are shown 
in Figure IFbl Figure 15^ and Figure [5dl respectively. We denote the transformations from IMDb to Freebase 
as IM2FB, from IMDb to Movielicious as IM2MV, and from IMDb to ASM as IM2AS. Since MVL and ASM 
structures do not have any character, we remove character nodes in IMDb when applying I M2 MV and IM2AS 
transformations. For query workload, we randomly sample 50 movies in IMDb database based on their degrees. 

Table [T] shows the average ranking differences for top 3, 5, and 10 answers returned by RWR and SimRank 


M:mlD:dl M:ml D:dl M:ml D:dl M:ml Cr D-.dl 



A:al C:cl A:al C:cl A:al A:a2 A:al A:a2 


(a) IMDB (b) FB (c) MVL (d) ASGN 

Figure 5: Fragments of movies databases where A, M, C, D, S, As, Cr and D-by denotes actor, movie, 
character, director, starring, actors, credit and directed-by. 

over IM2MV, IM2AS, and IM2FB transformations. Because R-PathSim delivers the same rankings over these 
transformations, we have omitted the results for R-PathSim. Because each entity label and its consecutive entity 
labels in every meta-walk over FB, MVL, and ASM data are different, all walks used in the computation of 
PathSim are informative, thus, PathSim is robust over these transformations according to Theorem 14. 5 1 Hence, 
we omit the results of PathSim from the table. According to Table [U the rankings produced by RWR and 
SimRank varies considerably over relationship-reorganizing transformations. As we have shown in Section 21 
PathSim is not robust under certain relationship reorganizing transformations. We use the SNAP dataset and 
reorganize it to the structure of DBLP-citation as depicted in Figure I3al For query workload, we randomly 
sample 50 papers from SNAP based on their degrees. We use [paper, paper, paper] meta-walk on SNAP and 
[paper, citation, paper, citation, paper] on DBLP-citation for PathSim and R-PathSim. The average ranking 
differences for top 3, 5 and 10 answers of PathSim are 0.564, 0.522 and 0.495, respectively. Hence, the output 
of PathSim varies significantly over some relationship-reorganizing transformations. 


proc-.prl 

paper:p 


areararl 

author:al 


prociprl 


area-.arl 

authorial 


(a) DBLP (b) SIGMOD Rec. structure 

Figure 6: Fragments of bibliographic databases. 


Entity Rearrangement: We use DBLP and WSU course databases to evaluate the robustness of RWR, 
SimRank, PathSim, and R-PathSim over entity-rearranging transformations. Because SimRank and RWR 
take too long to finish on full DBLP dataset, we perform the following experiments using a subset of DBLP. 
with 24,396 nodes and 98,731 edges. The FDs in DBLP database are paper -A proc, paper -A area, and 

[proc^paper^area] i i i i inn n nx/^iv 

proc - > area. We transrorm this database to a database that lollows the structure ot blGMOD 

Record from sigmod.org/puhlications, where the information about each collection of papers is directly connected 
to the node that represents the collection. Figure |6b] shows fragments of this database. The FDs in this 

[paper,proc,area] n i • n • r\n/^r*i 

database are paper proc^ proc —>• area, and paper - > area. We call this transtormation DB2bl. 

We randomly sample 100 proceedings based on their degrees in DBLP dataset as our query workload. The 

TnT~\ • TTrriTT 11 rr rr i i [course,of fer,subject] — 

bus in WbU database are offer —>• course, offer —>• subject, and course - > subject higurelT] 

depicts the transformation of our WSU Course dataset to the structure of the Alchemy UW-CSE database from 
alchemy.es.washington.edu/data/uw-cse. We call this transformation W2AL. The FDs in Alchemy UW-CSE 

[of fer,course,subject] 

database are offer —>• course, offer - > subject, and course —>• subject We randomly sample 100 

courses from WSU based on their degrees as our query workload. Table [T] shows the average ranking differences 


17 




















subjectisl 
courseicl *— 
instructovAl 


offeriol 


subject:sl^ 
:cl«^ 


cour se\cl 
instructovAl 




of fev.ol 


(a) WSU (b) Alchemy UW-CSE 

Figure 7: Schemas for the course database. 


for top 3, 5 and 10 answers from RWR, SimRank and PathSim under DB2SI and WS2AL. We use meta-walks 
\proc, area, proc] and [proc, paper, area, paper, proc] over DBLP and SIGMOD Record, respectively, for PathSim 
and R-PathSim. We use meta-walks [course, offer, subject, offer, course] and [course, subject, course] over WSU 
and Alchemy UW-CSE, respectively. Because R-PathSim returns the same answers over both transformations, 
we do not report its results. According to Table [U the outputs of all algorithms, except R-PathSim, are 
significantly different over entity-rearranging transformations. 

6.3 Efficiency and Effectiveness 

Efficiency: We evaluate the efficiency of R-PathSim and PathSim over full IMDb and DBLP data. We 
transform IMDb to Movielicious (MVL) structure that contains both informative walks and non-informative 
walks to evaluate the impact of detecting informative walks in R-PathSim. This results in a database of 1,272,253 
nodes and 2,886,494 edges. As we have explained in Section DBLP dataset satisfy some FDs. Thus, we use 
it to measure the influence of using meta-walks with ^-labels in the running time of R-PathSim. To explore the 
impact of both detecting informative walks and using meta-walks with ^-labels on the efficiency of R-PathSim, 
we add a node without value, called authors, that groups authors of the same paper in DBLP dataset. This 
modification introduces non-informative walks to the database. We call the resulting database DBLP+, which 
contains 1,905,092 nodes and 3,370,169 edges. We precompute and store commuting matrices for meta-walks 
of size, i.e., number of labels, up to 3 to be used in query processing as done in PathSim [53]. MVL, DBLP, 
and DBLP-f have 16, 16, and 22 meta-walks with sizes less or equal to 3, respectively. It takes 49, 153, and 156 
seconds for R-PathSim to precompute and store the commuting matrices of these meta-walk for MVL, DBLP, 
and DBLP-f, respectively, which which are reasonable for a pre-processing step. We have executed PathSim 
over the same datasets and get almost equal running times as the ones of R-PathSim. 

We randomly select 100 movies from MVL and 100 proceedings based on their degrees from DBLP and 
DBLP+ and use them as our query workloads. Because R-PathSim computes score over only informative meta- 
walks, we would like to measure the time used for the extra steps of detecting and ignoring non-informative 
walks. Because MVL and DBLP+ contains nodes without value, there exist non-informative walks in these two 
databases. Thus, we compare the query processing time of R-PathSim and PathSim over MVL and DBLP+. 
We first find a set of all maximal meta-walks of for given size over each database. Then we run R-PathSim and 
PathSim using these meta-walks, and compute the average time per query per meta-walk. Table [5] shows the 
average query processing of R-PathSim and PathSim per query per meta-walk given that commuting matrices up 
to size 3 are materialized. Overall, there is about 4% increase in running time of R-PathSim over PathSim due 
to an extra steps. Hence, the time spent on detecting and ignoring non-informative walks is almost negligible. 

Next, we analyze and compare the efficiency of aggregated R-PathSim and aggregated PathSim. We use 
Algorithm [2] to constructs a subset of maximal meta-walks which R-PathSim computes aggregated score over. 
Since there is no algorithm presented in |25l I24| about finding a subset of meta-walks to be computed over, we 
hnd a subset of meta-walks up to a given size. Then we measure the running time of aggregated PathSim over 
these meta-walks. 

Table [2] shows the query processing time of R-PathSim and PathSim per query, respectively, given that 
commuting matrices up to size 3 are materialized. The results indicate that the additional steps in R-PathSim 
to detect and ignore non-informative walks do not significantly increase its running time compared with that of 
PathSim. 

Table 0] and Table [3] show the average query processing time of R-PathSim and PathSim per query per 
meta-walk given that commuting matrices up to size 3 are materialized. The reported processing time of R- 
PathSim also includes the time that Algorithm 0] constructs a subset of meta-walks which R-PathSim computes 
aggregated score over. The set of entity labels L for the input of Algorithm 0] is the set of all entity labels in 
the database. The results indicate that the running time of R-PathSim is reasonable for the design-time task 
when using R < 3 and assume that Algorithm 0] is run in the preprocessing steps. 

Effectiveness: We evaluate the effectiveness of R-PathSim over the Microsoft Academic Search (MAS) 
dataset. For query workload, we randomly sample 100 conferences based on their degrees from the dataset. 
To provide the ground truth, given a conference q we manually group all other conferences in three categories: 
similar, which contains all conferences that have the same domain as q] quite-similar, which includes the 
conferences in the domains that are closely related to the domain of q; and least-similar that contain conferences 
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yize 

Fathyim 

K-Pathyim 

MVL 

5 

0.029 

0.033 

7 

0.021 

0.023 

DBLP+ 

5 

0.026 

0.010 

7 

0.760 

0.774 


Table 2: Average query time in second between PathSim and R-PathSim per meta-walk of a given size. 



yize 

IN uni 

I'imeF 

I'imeQ 


5 

3 

0.118 

0.099 

MVL 

7 

4 

0.079 

0.083 


9 

8 

0.119 

6.082 


5 

5 

0.075 

0.127 

DBLP 

7 

7 

0.076 

5.329 


9 

16 

0.083 

12.160 


5 

3 

0.090 

0.029 

DBLP+ 

7 

6 

0.077 

4.644 


9 

10 

0.086 

7.601 


Table 3: Average query time (Timeg) in second of aggregated PathSim. Time^ denotes the time in finding a 
set of meta-walks of size up to the given size, and Num denotes the number of meta-walks found. 


in the domains that are not strongly related to the domain of q. For example, Data Mining and Databases 
domains are strongly related, but Databases and Computer Vision are not. We use Normalized DCG (nDCG) 
to compare the effectiveness of R-PathSim and PathSim because it supports multiple levels of relevance for 
returned answers EH US]. The value of nDGG ranges between 0 and 1 where higher values show more effective 
ranking. We report the values of nDGG for top 5 (nDGG@5) and top 10 (nDGG@10) answers. In the first 
experiment, we use meta-walk [con/, paper^ citation, paper, citation, paper, conj\ to find similar conferences 
based on their papers’ citations. Since R-PathSim considers only informative walks of this meta-walk, it will 
return different results than PathSim. The average nDGG@5 (nDGG@10) for R-PathSim and PathSim are .264 
(.315) and .261 (.313) respectively. Although the value of nDGG for R-PathSim is higher than PathSim, the 
difference is not statistically significant according to the paired f-test at significant level of 0.05. In the second 
experiment, we evaluate the effectiveness of using meta-walks with =i=-labels. We compute the similarities of 
conferences based on the keywords in their domains. PathSim uses meta-walk [conf, paper, domain, keyword, 
domain, paper, conj\ and R-PathSim uses meta-walk [conf, paper, domain, keyword, domain, paper, conf. The 
average nDGG@5 (nDGG@10) for R-PathSim and PathSim are 1.0 (1.0) and 0.969 (0.901), respectively. R- 
PathSim significantly outperforms PathSim. Entities of type paper should not play a role in computing the 
similarity of conferences based on the keywords of their domains. Nevertheless, PathSim considers papers in 
determining these similarities. Hence, it deems conferences with more papers more similar, while they may not 
have that many common keywords. R-PathSim avoids this problem by treating paper as *-label. For example, 
the top 5 answers of R-PathSim for query SIGKDD are ICDM, IDEAL, PAKDD, PJW and PKDD. But, the 
top 5 answers of PathSim for the same query are ICOMP, IC-AI, ICAIL, ICALP and ICANN. 

Next, we measure the effectiveness of aggragated R-PathSim over a set of meta-walks found by Algorithmic) 
We generate a set of meta-walks over MAS using Algorithm [2] by giving a set of all entities in the dataset as an 
input and setting parameter i? to 1 and 2. We compute the aggregated R-PathSim score over these mata-walks 
using the same query workload. The average nDGG@5 (nDGG® 10) for R-PathSim using R equals to 1, 2 
and 3 are 1.0 (1.0), 0.976 (0.932) and 0.936 (0.844), respectively. To analyze our effectiveness results, we also 
computed aggregated PathSim over a set of all meta-walks of size up to 5, 7 and 9 in the MAS database using 
the same query workload. The average nDGG@5 (nDGG@10) for PathSim over a subset of all meta-walks of 
size up to 5, 7 and 9 are 0.969 (0.901), 0.943 (0.852) and 0.933 (0.820), respectively. The results of R-PathSim 
using R equals to 1 are significantly better than the results of PathSim computed over meta-walks of size up 
to 5 and 7. The results of R-PathSim using R equals to 2 are significantly better than the results of PathSim 




yize 

IN uni 

I'imeF 

'I’imeQ 

DBLP 

1 

5 

4 

0.273 

0.145 

2 

9 

8 

0.078 

1.829 


3 

13 

12 

1.214 

60.975 

DBLP+ 

1 

2 

7 

15 

4 

8 

4.103 

0.516 

0.858 

2.037 


3 

23 

12 

1.594 

58.811 


Table 4: Average query time (Timeg) in second of aggregated R-PathSim over DBLP and DBLP+. Timep 
denotes the time running Algorithm [d using all entities labels as the input and parameter R. Size and Num 
denote the maximum size and total number of maximal meta-walks found by the algorithm. 
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computed over meta-walks of size up 7. There is no significant difference between any other results between 
PathSim and R-PathSim. 

7 Conclusion 

We postulated that a similarity search algorithm should return essentially the same answers for the same query 
over different representations of a database. We introduced two families of frequently occurring representational 
shifts over graph databases called relationship reorganizing and entity rearranging transformations. We showed 
that current well-known similarity search algorithms are not representation independence and propose new 
algorithms that are representation independent under these transformations. 
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